24 research outputs found

    PLuTO: MT for online patent translation

    Get PDF
    PLuTO – Patent Language Translation Online – is a partially EU-funded commercialization project which specializes in the automatic retrieval and translation of patent documents. At the core of the PLuTO framework is a machine translation (MT) engine through which web-based translation services are offered. The fully integrated PLuTO architecture includes a translation engine coupling MT with translation memories (TM), and a patent search and retrieval engine. In this paper, we first describe the motivating factors behind the provision of such a service. Following this, we give an overview of the PLuTO framework as a whole, with particular emphasis on the MT components, and provide a real world use case scenario in which PLuTO MT services are exploited

    The application of morpho-syntatic language processing to effective information retrieval

    Get PDF
    The fundamental function of an information retrieval system is to retrieve texts or documents from a database in response to a user’s request for information, such that the content of the retreived documents will be relevant to the user’s original information need. This is accomplished through matching the user’s information request against the texts in the database in order to estimate which texts are relevant. In this thesis I propose a method for using current natural language processing techniques for the construction of a text representation to be used in an information retrieval system. In order to support this proposal I have designed a matching algorithm specifically for performing the retrieval task of matching user queries against texts in a database, using the proposed text representation. Having designed this text representation and matching algorithm, I then constructed an experiment to investigate the effectiveness of the algorithm at matching phrases. This experiment involved the use of standard statistical methods to compare the phrase matching capabilities of the proposed matching algorithm to a sample of information retrieval users performing the same task. The results of this evaluation experiment allow me to comment first of all on the effectiveness of the phrase matching algorihtm that I have designed and more generally, on the usefulness of incorporating natural language processing techniques into information retrieval systems

    Using SMT for OCR error correction of historical texts

    Get PDF
    A trend to digitize historical paper-based archives has emerged in recent years, with the advent of digital optical scanners. A lot of paper-based books, textbooks, magazines, articles, and documents are being transformed into electronic versions that can be manipulated by a computer. For this purpose, Optical Character Recognition (OCR) systems have been developed to transform scanned digital text into editable computer text. However, different kinds of errors in the OCR system output text can be found, but Automatic Error Correction tools can help in performing the quality of electronic texts by cleaning and removing noises. In this paper, we perform a qualitative and quantitative comparison of several error-correction techniques for historical French documents. Experimentation shows that our Machine Translation for Error Correction method is superior to other Language Modelling correction techniques, with nearly 13% relative improvement compared to the initial baseline

    Topic-dependent sentiment analysis of financial blogs

    Get PDF
    While most work in sentiment analysis in the financial domain has focused on the use of content from traditional finance news, in this work we concentrate on more subjective sources of information, blogs. We aim to automatically determine the sentiment of financial bloggers towards companies and their stocks. To do this we develop a corpus of financial blogs, annotated with polarity of sentiment with respect to a number of companies. We conduct an analysis of the annotated corpus, from which we show there is a significant level of topic shift within this collection, and also illustrate the difficulty that human annotators have when annotating certain sentiment categories. To deal with the problem of topic shift within blog articles, we propose text extraction techniques to create topic-specific sub-documents, which we use to train a sentiment classifier. We show that such approaches provide a substantial improvement over full documentclassification and that word-based approaches perform better than sentence-based or paragraph-based approaches

    Exploring the use of paragraph-level annotations for sentiment analysis of financial blogs

    Get PDF
    In this paper we describe our work in the area of topic-based sentiment analysis in the domain of financial blogs. We explore the use of paragraph-level and document-level annotations, examining how additional information from paragraph-level annotations can be used to increase the accuracy of document-level sentiment classification. We acknowledge the additional effort required to provide these paragraph-level annotations, and so we compare these findings against an automatic means of generating topic-specific sub-documents

    Progress of the PRINCIPLE project: promoting MT for Croatian, Icelandic, Irish and Norwegian

    Get PDF
    This paper updates the progress made on the PRINCIPLE project, a 2-year action funded by the European Commission un-der the Connecting Europe Facility (CEF) programme. PRINCIPLE focuses on col-lecting high-quality language resources for Croatian, Icelandic, Irish and Norwe-gian, which have been identified as low-resource languages, especially for build-ing effective machine translation (MT) systems. We report initial achievements of the project and ongoing activities aimed at promoting the uptake of neural MT for the low-resource languages of the project

    Achievements of the PRINCIPLE project: promoting MT for Croatian, Icelandic, Irish and Norwegian

    Get PDF
    This paper provides an overview of the main achievements of the completed PRINCIPLE project, a 2-year action funded by the European Commission under the Connecting Europe Facility (CEF) programme. PRINCIPLE focused on collecting high-quality language resources for Croatian, Icelandic, Irish and Norwegian, which are severely low-resource languages, especially for building effective machine translation (MT) systems. We report the achievements of the project, primarily, in terms of the large amounts of data collected for all four low-resource languages and of promoting the uptake of neural MT (NMT) for these languages

    Cross-Language Information Retrieval in a Multilingual Legal Domain

    No full text
    . We describe here the application of a cross-language information retrieval technique based on similarity thesauri in the domain of Swiss law. We present the theory of similarity thesauri, which are information structures deerived from corpora, and show how they can be used for cross-language retrieval. We also discuss the collections of Swiss legal documents and show how we have used them to construct an environment in which we can directly evaluate the performance of our cross-language retrieval system. Evaluation shows that cross-language retrieval works equally as well as monolingual retrieval in the best case. We conclude that providing cross-language access to digital libraries is already a viable possibility. 1 Introduction One of the great benefits of digital libraries is the ability to make information available to a wide audience without geographic constraints. As large-scale digital libraries contribute to the dismantling of the geographic barriers to information access ho..
    corecore